Skip to content

Conversation

adriangb
Copy link
Contributor

Closes #16545

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Jul 23, 2025
@adriangb
Copy link
Contributor Author

@theirix could you take a look? Are there any other expressions we should dissallow?

@alamb are there any existing APIs to get "volatility" from a PhysicalExpr? If not should we add some?

@theirix
Copy link
Contributor

theirix commented Jul 23, 2025

Thank you! I'll check my cases.

@theirix
Copy link
Contributor

theirix commented Jul 23, 2025

Thank you, @adriangb ! I can confirm that it works great with the table sampling, since I use random function (matched by name):

query TT
EXPLAIN SELECT COUNT(*) from t TABLESAMPLE 42 WHERE a < 10;
----
logical_plan
01)Projection: count(Int64(1)) AS count(*)
02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
03)----Projection:
04)------Filter: t.a < Int32(10) AND random() < Float64(0.42)
05)--------TableScan: t projection=[a]
physical_plan
01)ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
02)--AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
06)----------ProjectionExec: expr=[]
07)------------CoalesceBatchesExec: target_batch_size=8192
08)--------------FilterExec: a@0 < 10 AND random() < 0.42
09)----------------DataSourceExec: partitions=1, partition_sizes=[1]

The volatile filter is not pushed to the datasource. Without this patch, it looked like predicate=random() < 0.1.

I agree it'd be more scalable to have an abstract way to specify UDF volatility.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb and @theirix -- this looks good to me. I think we need to change the volatility check but otherwise this is good to go

It is probably reasonable to add a pub fn volatility() method to PhysicalExpr as well so that user defined expressions that are volatile / shouldn't be pushed down can be excluded as well

expr.as_any()
.downcast_ref::<datafusion_physical_expr::ScalarFunctionExpr>()
{
let name = scalar_function.fun().name();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate catalog Related to the catalog crate common Related to common crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark labels Jul 28, 2025
adriangb added 4 commits July 28, 2025 12:33
* checkpoint: Address PR feedback in https://github.com/apach...

* add FilteredVec to consolidate handling of filter / remap pattern
@github-actions github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion logical-expr Logical plan and expressions labels Jul 28, 2025
@github-actions github-actions bot removed physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate catalog Related to the catalog crate common Related to common crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark labels Jul 28, 2025
@adriangb adriangb requested a review from alamb July 28, 2025 17:33
@adriangb adriangb changed the title dissallow pushdown of volatile PhysicalExprs disallow pushdown of volatile functions Jul 28, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb and @theirix

The only thing I think we should try and have is some way to test this "end to end" slt test. Here is a proposal for your consideration:

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 28, 2025
@alamb alamb merged commit 94e8548 into apache:main Jul 29, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 29, 2025

🚀

Standing-Man pushed a commit to Standing-Man/datafusion that referenced this pull request Aug 4, 2025
* dissallow pushdown of volatile PhysicalExprs

* fix

* add FilteredVec helper to handle filter / remap pattern (apache#34)

* checkpoint: Address PR feedback in https://github.com/apach...

* add FilteredVec to consolidate handling of filter / remap pattern

* lint

* Add slt test for pushing volatile predicates down (apache#35)

---------

Co-authored-by: Andrew Lamb <[email protected]>
LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 2, 2025
* dissallow pushdown of volatile PhysicalExprs

* fix

* add FilteredVec helper to handle filter / remap pattern (#34)

* checkpoint: Address PR feedback in https://github.com/apach...

* add FilteredVec to consolidate handling of filter / remap pattern

* lint

* Add slt test for pushing volatile predicates down (#35)

---------

Co-authored-by: Andrew Lamb <[email protected]>
(cherry picked from commit 94e8548)
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request Sep 9, 2025
* Enable physical filter pushdown for hash joins (apache#16954)

(cherry picked from commit b10f453)

* Add ExecutionPlan::reset_state (apache#17028)

* Add ExecutionPlan::reset_state

Co-authored-by: Robert Ream <[email protected]>

* Update datafusion/sqllogictest/test_files/cte.slt

* Add reference

* fmt

* add to upgrade guide

* add explain plan, implement in more plans

* fmt

* only explain

---------

Co-authored-by: Robert Ream <[email protected]>

* Add dynamic filter (bounds) pushdown to HashJoinExec (apache#16445)

(cherry picked from commit ff77b70)

* Push dynamic pushdown through CooperativeExec and ProjectionExec (apache#17238)

(cherry picked from commit 4bc0696)

* Fix dynamic filter pushdown in HashJoinExec (apache#17201)

(cherry picked from commit 1d4d74b)

* Fix HashJoinExec sideways information passing for partitioned queries (apache#17197)

(cherry picked from commit 64bc58d)

* disallow pushdown of volatile functions (apache#16861)

* dissallow pushdown of volatile PhysicalExprs

* fix

* add FilteredVec helper to handle filter / remap pattern (#34)

* checkpoint: Address PR feedback in https://github.com/apach...

* add FilteredVec to consolidate handling of filter / remap pattern

* lint

* Add slt test for pushing volatile predicates down (#35)

---------

Co-authored-by: Andrew Lamb <[email protected]>
(cherry picked from commit 94e8548)

* fix bounds accumulator reset in HashJoinExec dynamic filter pushdown (apache#17371)

---------

Co-authored-by: Adrian Garcia Badaracco <[email protected]>
Co-authored-by: Robert Ream <[email protected]>
Co-authored-by: Jack Kleeman <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Physical plan pushdown for volatile predicates
3 participants